set.seed(10)
x<-matrix(rnorm(100*2), ncol=2)
x[1:50,]<-x[1:50,]+2
x[51:75,]<-x[51:75]-2
y<-c(rep(1,75),rep(2,25))
plot(x, col=y)
dat<-data.frame(x=x,y=as.factor(y)) # encode response as factor
train<-sample(100,50)
By plotting the results, we can see if the classes are linearly separable. They do not appear so. We will now apply a Support Vector Classifier.
library(e1071)
svmfit<-svm(y~.,data=dat[train,],kernel="linear",cost=10,scale=FALSE)
plot(svmfit,dat)
Tune to perform cross-validation and store the best model.
set.seed(1)
tune.out<-tune(svm,y~.,data=dat[train,],kernel="linear",ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod<-tune.out$best.model
Now we can predict the class label on a set of test observations.
table(true=dat[-train,"y"],pred=predict(bestmod,newx=dat[-train,]))
## pred
## true 1 2
## 1 39 0
## 2 11 0
In this case, 20% of the observations are classified incorrectly.
Moving on to Support Vector Machine. Fit with a radial kernel.
svmfit<-svm(y~.,data=dat[train,],kernel="radial",gamma=1,cost=1)
plot(svmfit,dat[train,])
plot(svmfit,dat[-train,])
This is cool. It shows an apparent non-linear boundary. Now, let’s tune.
tune.out<-tune(svm,y~.,data=dat[train,],kernel="radial",ranges=list(cost=c(0.1,1,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 2
##
## - best performance: 0.06
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.28 0.25298221
## 2 1e+00 0.5 0.12 0.16865481
## 3 1e+01 0.5 0.10 0.14142136
## 4 1e+02 0.5 0.10 0.14142136
## 5 1e+03 0.5 0.12 0.10327956
## 6 1e-01 1.0 0.28 0.25298221
## 7 1e+00 1.0 0.12 0.13984118
## 8 1e+01 1.0 0.10 0.14142136
## 9 1e+02 1.0 0.10 0.10540926
## 10 1e+03 1.0 0.16 0.15776213
## 11 1e-01 2.0 0.28 0.25298221
## 12 1e+00 2.0 0.06 0.09660918
## 13 1e+01 2.0 0.06 0.09660918
## 14 1e+02 2.0 0.10 0.14142136
## 15 1e+03 2.0 0.18 0.17511901
## 16 1e-01 3.0 0.28 0.25298221
## 17 1e+00 3.0 0.08 0.10327956
## 18 1e+01 3.0 0.06 0.09660918
## 19 1e+02 3.0 0.12 0.10327956
## 20 1e+03 3.0 0.12 0.10327956
## 21 1e-01 4.0 0.28 0.25298221
## 22 1e+00 4.0 0.10 0.14142136
## 23 1e+01 4.0 0.06 0.09660918
## 24 1e+02 4.0 0.12 0.10327956
## 25 1e+03 4.0 0.12 0.10327956
It appears the best cost is 1 and the best gamma is 0.5. Now time to predict!
table(true=dat[-train,"y"],pred=predict(svmfit,newx=dat[-train,]))
## pred
## true 1 2
## 1 30 9
## 2 8 3
In this case, 44% of the test observations are misclassified. I’m confused why the SVM is being out-performed by the SVC since the data appears to have a non-linear boundary, which the SVM accounts for but the SVC does not.